Cliques and duplication-divergence network growth 



I. Ispolatov/'BP. L. Krapivsky, 2 'Q I. Mazo, 1 ]]] and A. Yuryev 1 '! 

'Ariadne Genomics Inc,. Rockville, MD 20850 
2 Center for Polymer Studies and Department of Physics, Boston University, Boston, MA 02215 

(Dated: February 4, 2008) 

A population of complete subgraphs or cliques in a network evolving via duplication-divergence is consid- 
ered. We find that a number of cliques of each size scales linearly with the size of the network. We also derive a 
clique population distribution that is in perfect agreement with both the simulation results and the clique statistic 
of the protein-protein binding network of the fruit fly. In addition, we show that such features as fat-tail degree 
distribution, various rates of average degree growth and non-averaging, revealed recently for only the particular 
case of a completely asymmetric divergence, are present in a general case of arbitrary divergence. 



PACS numbers: 89.75.Hc, 02.50.Cw, 05.50.+q 
I. INTRODUCTION 

The duplication-divergence mechanism LU [2|] of network 
growth is traditionally used to model protein networks: A 
duplication of a node is a consequence of the duplication 
of the corresponding gene, and a divergence or loss of re- 
dundant links or functions is a consequence of gene muta- 
tions 01 0, IH 0. General properties of the duplication- 
divergence growth have recently been studied for probably the 
simplest version of the duplication-divergence model which is 
the asymmetric divergence |7]. Yet even this simplest model, 
where links are removed with a certain probability only from 
the replica node, turned out to have very rich phenomenol- 
ogy and to reproduce the degree distribution, observed in real 
protein-protein networks, surprisingly well. Overall, when the 
link removal probability is small, the network growth is not 
self-averaging and an average vertex degree is increasing al- 
gebraically. For larger values of the link removal probability, 
the growth is self-averaging, the average degree increases very 
slowly or tends to a constant, and a degree distribution has a 
power-law tail. 

A natural next step in exploring properties of the 
duplication-divergence networks is to consider their modu- 
lar structure and distribution of various subgraphs or motifs. 
Small subgraphs are often considered building blocks of net- 
work; densities of particular subgraphs may tell if a network 
belongs to a certain "superfamily" |8] or performs specific 
functions |9]. Abundances of triangles and loops have been 
studied in the Internet, random and preferential attachment 
networks and regular scale-free graphs tTol fTH fl^. fTlll . Den- 
sities of small motifs and cycles centered on a vertex were 
considered as a function of the vertex degree and clustering 
coefficient in |14]. In protein-protein networks, highly in- 
terconnected subgraphs were found to be well-conserved in 
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evolution fl^l and to correspond to functional protein mod- 
ules in living cells 11611 . An extreme case of highly intercon- 
nected motifs are cliques, or completely connected subgraphs. 
Cliques have been found in higher than random abundances in 
protein-protein networks in yeast 1 16]. 

In the paper we consider a generalization of duplication- 
divergence network growth mechanism, duplication- 
divergence-heterodimerization. The heterodimerization, 
or linking a certain number of the pairs of target and 
replica nodes, is essential for clustering and is observed in 
protein-protein networks |17]. We show that duplication- 
divergence-heterodimerization produces the cliques in the 
number very similar to those observed in protein-protein 
networks. 

As in our previous work |7], we again start with the sim- 
plest case of the completely asymmetric divergence. Yet in 
real protein networks, apart from special cases of partially 
asymmetric divergence 1131 . the divergence is believed to be 
close to symmetric 11911 . It turns out that the asymmetric diver- 
gence results for the clique statistics as well as the previously 
obtained results for the network growth Q] are qualitatively 
similar to those in the arbitrary divergence case, where links 
are removed with given probabilities both from the target and 
replica nodes. 

The paper consists of two principle parts: In the next sec- 
tion we derive the clique abundance distribution for the asym- 
metric case and compare it to the simulation and experimental 
results. In Sec. Ill we generalize these and previously ob- 
tained results for the network growth and degree distribution 
onto the arbitrary divergence case. A Discussion and Conclu- 
sion section completes the paper. 



II. CLIQUES 

Protein-protein networks exhibit a distinct modular struc- 
ture and contain densely linked neighborhoods or complexes 
(| 16] and references therein). The extreme case of densely 
linked complexes are cliques or completely connected sub- 
graphs where each vertex is connected to all other subset 
members. Cliques of the sizes of up to 14 vertices were found 
in much higher than "random" abundance in protein binding 
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network of yeast Jl6ll . Yet many large cliques observed in pro- 
tein networks may be artifacts of specific experimental tech- 
niques or even of misinterpretation of the experimental data. 
For example, there is a strong evidence that all cliques of order 
higher than six in the yeast interaction network |21] consid- 
ered in flfill result from the "matrix" recording of the experi- 
mental data from the mass-spectrometry experiments. In such 
experiments an immunoprecipitation is used to isolate stable 
protein complexes. Usually a single protein is used as a target 
for the antibody; binding of the antibody to this protein leads 
to an isolation of the the entire complex. However, the precise 
pairwise binding between proteins in the complexes strictly 
speaking remains undefined if a complex contains more than 
two proteins. Yet in the "matrix" interpretation of the mass- 
spectrometry experiment all possible pairwise interactions be- 
tween proteins in the complex are usually recorded. A well- 
known example of such erroneous recording is the anaphase- 
promoting complex. It is reported as a 1 1 -node clique in three 
different mass-spectrometry high-throughput interaction sur- 
veys of yeast genome and in the MIPS database |21|. The 
biggest reported clique in yeast network, SAGA/TFIID com- 
plex 1 16], is also the result of erroneous "matrix" recording of 
the data from a co-immunoprecipitation experiment described 
in 12011 . 

However, a virtually free of subjective interference two- 
hybrid method, used to determine the protein binding net- 
work of fruit fly 1 22], yields also higher than "random" num- 
ber of cliques. Specifically, the fly dataset contains 1405 tri- 
ads, 35 4-cliques and one 5-clique, while a randomly re-wired 
graph of fly dataset contains only 1 147 triads and 8 4-cliques 
12311 . Here and below, the lower-oder cliques that comprise the 
higher-order ones (each clique with j vertices or "j-clique" 
consists of j cliques with j — 1 vertices which can be obtained 
by eliminating one of the j vertices) are counted along with 
the non-trivial cliques. The number of only non-trivial cliques 
is slightly lower; the fly dataset contains 1297 non-trivial tri- 
ads, 30 4-cliques and one 5-clique. 

Is such high concentration of large cliques caused by an 
evolutionary pressure that specifically favors big cliques, or 
by some stochastic mechanism of network evolution? Evi- 
dently, a simple duplication-divergence network growth never 
produces even a single triad as new duplicates are never linked 
to their ancestors 01 • Random mutations, or re- wiring of 
some links will give rise to a certain number of cliques, yet 
their abundance will be much less than the experimentally 
observed one l23ll . However, in Il7ll it was concluded 
that links between paralogs (or recently duplicated pairs of 
proteins) are significantly more common than if such links 
appeared by random mutations. Most of these paralogous 
links are formed when a self-interacting protein or (homod- 
imer) is duplicated |17], thus giving rise to a pair of inter- 
acting heterodimers. While after divergence certain pairs of 
heterodimers loose their ability to interact, some paralogs re- 
tain their propensity to heterodimerize. In the following we 
show that the simple duplication-divergence network growth 
complimented with heterodimerization of some pairs of dupli- 
cates does explain the observed abundance of cliques without 
invoking any evolutionary pressure. 
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FIG. 1: A sketch of duplication event when a new triad is formed 
with a heterodimerization link. Solid lines correspond to the existing 
links, dotted line is a heterodimerization link, established with the 
probability P, and dashed lines denote the inherited with probability 
a links. 

The duplication-divergence-heterodimerization process 
consists of duplication-divergence, previously introduced in 

0, 

• Duplication. A randomly chosen target node is dupli- 
cated, that is its replica is introduced and connected to 
each neighbor of the target node. 

• Divergence. Each link emanating from the replica is 
activated with probability a (this mimics link disappear- 
ance during divergence). 

and heterodimerization, 

• Heterodimerization. The target and replica nodes are 
linked with probability P. It mimics the probability that 
the target node is a dimer and the propensity for dimer- 
ization is preserved during divergence. 

Similarly to the "pure" duplication-divergence growth 0], the 
replica is preserved if at least one link is established; other- 
wise the attempt is considered as a failure and the network 
does not change. 

Let us first consider an evolution of population of triads, or 

3- cliques. Two processes that give rise to new triads are il- 
lustrated in Figs. 1 1121 During the first process a target vertex 
1, initially linked to the vertices 2 and 3, is duplicated to pro- 
duce a new vertex 4. The resulting pair of duplicates 1 and 4 
have a probability P to be linked. In addition, links 4-2 and 

4- 3 are inherited with the probability a each. As a result of 
this process, two new triads 1-4-2 and 1-4-3 are formed, each 
with probability Pa. In the second process (Fig. |2ji a new 
triad is produced from the existing one when one of its ver- 
tices (vertex 1) is duplicated. The new triad is formed only 
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FIG. 2: A sketch of duplication event when a new triad is formed 
by duplicating the existing one. Solid lines correspond to the exist- 
ing links and dashed lines denote the links, each inherited with the 
probability a. 



if both links, 4-2 and 4-3 survive divergence, which happens 
with the probability a 2 . 

Correspondingly, a rate equation for the increase in 
the number of triads C3 per duplication-divergence- 
heterodimerization step contains two terms, 



AG, = crP- 



2L 



>3C, 



(1) 



N N 

where L and N are the numbers of links and vertices in the 
network. The fraction 2L/N in the first term is an average 
number of links picked up for a potential triad (which is also 
equal to the average degree (d)). The factor 3 in the second 
term indicates that each of the three vertices in the existing 
triad can be picked up as a target vertex for duplication. 

Considering links as 2-cliques, the first term in Eq. Q 
can be interpreted as describing a creation of 3-clique from a 
lower-order 2-clique. It is easy to see that with such interpre- 
tation, the Eq. Q can be generalized to describe the evolution 
of population of cliques of an arbitrary order, 
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Here v < 1 is an increment in the number of vertices per 
duplication step. In the following we focus on a biologically- 
relevant regime of < a < 1/2 where the average degree (d) 
is constant or almost constant |7]. In this regime v — 2a, and 
assuming scaling for Cj, Cj = Ncj, one obtains a recurrent 
relation for the rescaled j-clique abundance, 

_ (j - 1,,-, ,~» 

For large j the second term in denominator becomes subdom- 
inant, 
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TABLE I: Number of j-cliques in networks with N = 6954 vertices 
and L — 20435 links for Cj iH - fruit fly protein-protein binding 

network, CJ - simulations with a = 0.38 and P = 0.03, and C) h - 



Cj ~ (j - 1)!^'- 3 ^'- 2 )/ 2 (p/2y- 



(4) 



Eq. |3 prediction for the same a and P. 



It follows that the relative population of large cliques decays 
faster than exponentially. This rends large cliques highly im- 
probable in networks of biologically relevant size of N ~ 10 4 . 

To check this analytical prediction and to see if the pro- 
posed duplication-divergence-heterodimerization model ex- 
plains the observed population of cliques, we performed the 
following numerical simulation. As in 0], we fix a = 0.38 
so that the average degree is equal to that of the fly dataset, 
where (d) « 5.9 for A^ = 6954 proteins fl^. We select 
P = 0.03 so that the number of triads in the simulated net- 
work is also similar to that in the fly dataset and count the 
number of 4- and 5-cliques in the resulting network. The the- 
oretical Cj are computed for the same a and P taking into 
account that C2 = (d)/2. Results of simulations CJ averaged 
over 2000 network realizations, the computed , and the 

clique abundances in the fly dataset C^ ly are shown in TableU 
The agreement between the experimental dataset, simulations, 
and Eq. |5]is surprisingly good, especially given the fact that 
in for a = 0.38, (d) — const only approximately |7]. 



III. SYMMETRIC DIVERGENCE 

In this section we generalize the results obtained in 0] and 
above for the case of completely asymmetric divergence onto 
an arbitrary divergence case. The arbitrary divergence model 
is defined as follows: 

1 . Duplication. A randomly chosen target node is dupli- 
cated, that is, its replica is introduced and connected to 
all neighbors of the target node. 

2. Divergence. Each link emanating from either the target 
or the replica node is independently removed with prob- 
ability 1 — cj\ and 1 — (72, correspondingly. This mimics 
disappearance of links during divergence from initially 
indistinguishable target and replica nodes. Vertices that 
lost all their links during this process (this may include 
both the target and the replica vertices as well as their 
neighbors) are discarded. 

Unlike in the asymmetric duplication-mutation models, the 
symmetric growth model may generate network consisting 
of more than one disconnected components. Vazquez and 
co-workers |4] investigated a symmetric model which only 
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slightly differs from the fully symmetric version (<ti = cr%) of 
our model. 



A. Growth law 

As in 01, an increment in the number of links L during a 
duplication step is, 



AL _ 21,(01 +Q-2-1) 
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where N is the number of vertices, 2L /N = (k) is the average 
number of neighbors or the average degree, and < v < 1 
is an increment in the number of vertices per step. Assuming 
that for a large network v does not depend on the network size 
N, we obtain, 



L(N) ~ 7v 2(CTl+<T2_1)/ ' y 



(6) 



As in the asymmetric case, there exist three distinct regimes: 

• Since at a duplication step the number of vertices cannot 
increase by more than one, v < 1 andforcr 1 +(T2 > 3/2 
the growth of L(N) is superliniear. The average degree 
grows as a power-law of a network size, and for suf- 
ficiently large networks the probability to eliminate all 
the links and therefore, not to add a vertex at a duplica- 
tion step becomes negligible. Hence for large networks 
v — ► 1 and 
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• For <7i + <72 < 3/4 and o\ > a\, <ji > a\ (where 
the lower bounds a* will be determined below), we ob- 
serve that the average degree increases logarithmically 
andL - iVln(iV). 

• Since only linked vertices are counted, the average de- 
gree cannot degrease below unity. Hence even for small 
link retention probability, 1 < o\ + 02, and <j\ < a*, 
<72 < 0"! tne g rowm of L is linear, L ~ N and the 
average degree saturates to a constant. 



B. Degree distribution 

As in 0, the degree distribution Nk is described by the 
following rate equation, 
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Here the first three terms describe the gain of two new degrees 
of the duplicated vertices and the loss of an old degree of the 
target vertex, while the fourth term accounts for a change in 
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FIG. 3: (Color online) The average node degree (k) vs iV for (bottom 
to top) for the completely symmetric network growth, <7i = as = 
0.6, 0.75, 0.85. Solid lines are corresponding best fits, (fe) = const 
forcri = cr 2 = 0.6, (fc) ~ TV 0,14 or (k) ~ In AT forcri = a 2 = 3/4, 
and (k) ~ A^ 41 for a x = a 2 = 0.85 ((fc) ~ A^ ' 4 from Q). The 
results are averaged over 100 network realizations. 
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FIG. 4: (Color online) Scaling of the degree distribution in the net- 
works of N = 200, N = 2000, and N = 20000 nodes with 
0.85. 
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the number of degrees of a neighbor of a target vertex. Sub- 
stituting Nk oc Nk~~< and using v = 2(01 + 02 — 1) (which 
follows from 0), we obtain 



+al 1 + (0 1 +0 2 -l)(7-l) + l-2(0i + 2 ) =0. (9) 

This equation has a trivial 7' = 2 and a non-trivial solution 
7(04 j v%) which intersect at {a\ , 02 ) that satisfy the equation, 
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An important example is the symmetric case, a\ = a\ 
0.72985; in the asymmetric case 01 = 1 and 03 = 1/e 
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FIG. 5: The degree distribution exponent 7(0") for the symmetric 
divergence from Eq. |9j, 7 ~ l/(2cr — 1) for a ^ 1/2 + 

10" , 
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FIG. 6: The degree distribution rik for symmetric divergence, a\ = 
(72 = 0.675. A dashed line is the predicted power-law asymptotics 
with the exponent 7(0.675) ~ 4.3. 

0.36879 The resulting exponent 7 for the symmetric case 
is plotted in Fig. [5] The measured in simulation degree distri- 
bution for 1/2 < a < a* indeed follows the predicted power- 
law asymptotics, Fig. [6] 

A summary of results for the arbitrary-symmetric 
duplication-divergence is presented in TablellTl 



C. Cliques 

Similarly to the asymmetric divergence considered above, 
to generate cliques one needs to add heterodimerization to the 
pure duplication and divergence. Hence we assume that a tar- 
get and a replica nodes are linked with probability P. 



A generalization of Eq. (0 reads 

ACj = (j - l^^Pja^y- 2 
AN vN 

vN vN 

Since a creation of a new clique requires that all links ema- 
nating both from the target and replica vertices survive diver- 
gence, in the first two terms a is replaced by erio^- the third 
term accounts for loss of j-cliques due to disappearance of at 
least one link both from the target and replica nodes. Follow- 
ing the procedure for the asymmetric case and taking into ac- 
count that in the scaling regime where 1/2 < a - ! + er 2 < 3/2, 
v = 2(<7i +(72 — 1), we obtain the recurrent relation (an analog 
ofEq.GJ), 

3 2{a 1 +o 2 -l)- 3 {a{- 1 - 1)' 

We check this prediction for a completely symmetric case 
(Ti = (T2 = (7, again using the fly dataset 12211 for reference. 
The correct average degree and number of triads are obtained 
when cr sa 0.725 and P w 0.0475. The experimental, simu- 
lation, and theoretical results, shown in Table lllll are again in 
very good agreement. 

D. Integrity of the network 

For symmetric divergence, we measure the number of com- 
ponents and the size of the largest component for the networks 
grown with various a\ = 02 = a. The results for the net- 
works of the size of fruit fly dataset, A" = 6954, are presented 
in TablelPTl 

It follows that for 1/2 < a < a* the grown network con- 
sists of many fairly small components, while for a* < a there 
is usually one or few large components and several small ones. 
Intuitively it is clear that if the average degree grows, even 
slowly, the probability to split the network into many parts 
becomes small. 

A theoretical prediction for the size of the giant component 
exists only for the Erdos-Renyi random graph 1 24]: When the 
average degree scales logarithmically with the number of ver- 
tices, i.e., (d) — p\nN, the total number of vertices that do 
not belong to the giant component scales as N 1 ~ p forp < 1, 
while forp > 1 the giant component engulfs the entire system. 
It turns out that for the same number of vertices and links, the 
completely random linking of the Erdos-Renyi graph keeps 
more vertices in a giant component than the corresponding 
duplication-symmetric divergence network. Indeed, for the 
parameters corresponding to the fly dataset, a-y = cr% = 0.725, 
A^ = 6954, (d) s» 5.9, and p — 0.667, the number of vertices 
not belonging to the giant component is 6954 x 0.08 « 556 
(see Table lIV> . Yet the Erdos-Renyi graph with the same num- 
ber of vertices and links has only ~ 6954 333 k 19 vertices 
outside of its giant component. This happens mainly because 
in our duplication-divergence growth model, once a compo- 
nent is split from the giant component, it never re-connects. 
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TABLE II: The behavior of the duplication-divergence network of arbitrary symmetry for different values of probabilities to preserve a link 
0"i and 02. Here L(N) is the average number of links for given number of nodes N, the average fraction of nodes of degree k. <r|, i = 1, 2 
are the solutions of Eq. 1101 . 7(0-1, 0-2) is given by Eq. 
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3 1405 


1353 ± 9 


1377 
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28 ±1 


28 


5 1 


0.24 ±0.03 


0.24 
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0.0025 ± 0.0016 


0.0011 


TABLE III: Number of j-cliques in networks with N = 6954 
vertices and L = 20435 links for Cj * - fruit fly protein-protein 
binding network, C| - simulation of symmetric divergence with 
0-1=0-2 = 0.725 and P = 0.0475, and Cf - Eq. {l2j prediction 
for the same a and P. Simulation results are averaged over 2000 
network realizations. 
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N L /N 


0.8 


1.1 ±0.01 


99 ± 0.2% 


0.725 


8.4 ±0.2 


92 ± 0.4% 


0.65 


232 ± 1 


33 ± 1% 


0.6 


835 ± 1.4 


2.7 ± 0.03% 



TABLE IV: Number of components n c and the number of vertices in 
the largest component normalized by the network size, Nl /N, in the 
duplication-symmetric divergence networks for various a\ = o<2, = 
a. All networks are grown to the fly dataset size, TV = 6954; the 
results are averaged over 1000 realizations. 



If such separation happens at an early stage of the network 
growth, the separated component may grow to a significant 
size, thus leaving many vertices outside of the giant compo- 
nent. On contrary, at each step of the Erdos-Renyi growth, 
any two components can be united with a random link. This 
makes the co-existence of two or more large components very 
unprobable. 

IV. DISCUSSION AND CONCLUSION 

In the previous sections the following conclusions on 
the clique abundances and growth laws of the duplication- 
divergence-heterodimerization networks have been made: 

• We showed that the duplication-divergence network 
growth model, complimented with heterodimeriza- 
tion links between duplicates, correctly describes the 



statistics of cliques in biologically observed protein- 
protein networks. We derive an expression for 
clique population distribution that correctly describes 
the clique abundances in the duplication-divergence- 
he terodimerization networks. 

• Generalizing the results obtained for the completely 
asymmetric divergence in (31, we demonstrated that 
similar regimes, such as presence and lack of self- 
averaging, growth and saturation of the average degree, 
scaling and fat tail in the degree distribution, exist in 
general duplication-divergence case as well. In addi- 
tion, a clique density distribution is generalized onto the 
arbitrary divergence scenario. 

The heterodimerization links are not taken into account in 
our description of the network growth and degree distribution. 
Despite their crucial role in the network topology and clique 
formation, they constitute only about 1% of all links and do 
not contribute significantly to the degrees of the most of the 
vertices. For link inheritance and heterodimerization proba- 
bilities a and P, corresponding to the fly dataset, the resulting 
number of heterodimeric links in a network of the size of the 
fly dataset is L hd « PN/(2a) w 270. This is somewhat 
higher than the observed number of links between the pairs 
of recently duplicated (paralogous) proteins L{ 1 ¥ = 142 ll7ll . 
The main reason for this discrepancy is that in our simulation 
all heterodimeric links are counted, while in the real protein 
network one can reliably identify only the pairs of recently 
duplicated proteins. 

In a case of not completely asymmetric divergence when 
links can disappear both from the target and replica nodes, a 
network may fragment into several components. Yet the bio- 
logical protein networks are believed to be connected to en- 
sure their functionality. Hence during in vivo divergence the 
steps that lead to breaking the network into isolated compo- 
nents are excluded due to evolutionary pressure. Our proba- 
bilistic network growth model does not take any evolutionary 
pressure into account. However, since for sufficiently high 
link retention probabilities the resulting network consists of 
one or very few large components, the number of link elimina- 
tions that have to be evolutionally overridden is small. Hence 
most of the properties of the probabilistically grown graphs 
should be similar to those of the realistic evolutionary single- 
component networks. As the link inheritance probabilities 
Cj decrease and the number of network components grow, 
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the number of link removal steps that have to be evolution- 
ally overridden becomes large. Consequently, the probabilis- 
tic multi-component network becomes less similar to the real 
single-components one. 

As we mentioned in the Section II, we selected the fly 
dataset as an example as being the most non-subjective one. 
Other know protein-protein networks, such as for Yeast, 
Worm, and Human, do contain parts of data that are results 
of the "matrix" recording of the experimental data from the 
immunoprecipitation experiments. These datasets contain a 
higher number of large cliques which can be attributed to this 
data interpretation. In principle, the clique population distri- 
bution derived here can be used to verify and filter the exper- 
imental datasets, removing the erroneously recordered large 
cliques. 

In a recent publication, Middendorf et al |25] com- 
pared topological properties of the fly dataset to those of 
the networks grown by several mechanisms such as differ- 
ent versions of duplication-mutation model and preferen- 
tial attachment. It was found that a duplication-mutation- 
complementation network provides the best fit to the fly 
dataset. The duplication-mutation-complementation network 
growth model is very close to the duplication-divergence- 
heterodimerization model studies here. Complementation 
is equivalent to heterodimerization, the only difference be- 
tween two models is in the way the links are deleted dur- 
ing divergence (or mutation): Unlike our model, in l25ll each 
neighbor remains connected to at least one of the two dupli- 



cates. Thus we confirmed the conclusions made in l25ll that 
practically all considered properties of protein-protein net- 
works are very well described by the duplication-divergence- 
heterodimerization model. 

And finally a few words on the importance of heterodimer- 
ization links in clique formation. An alternative to het- 
erodimerization way to connect paralogs is to link them ran- 
domly by "mutation" links. In this case the probability to es- 
tablish a heterodimeric link P has to be replaced by a proba- 
bility that a mutation link, emanating from a target node, se- 
lects the replica node out of N network nodes. This proba- 
bility is equal to M/N where M is the number of mutation 
links established at each duplication step. In the example of 
the fruit fly dataset where P = 0.03 and N — 6954, one 
needs M — NP = 209 random links at each step to form 
the correct number of triads and higher cliques. Obviously, 
the mutation scenario which requires so many additional links 
is completely ruled out due to, for example, average degree 
constraint. 
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